scSemiGCN: boosting cell-type annotation from noise-resistant graph neural networks with extremely limited supervision

Abstract Motivation Cell-type annotation is fundamental in revealing cell heterogeneity for single-cell data analysis. Although a host of works have been developed, the low signal-to-noise-ratio single-cell RNA-sequencing data that suffers from batch effects and dropout still poses obstacles in discovering grouped patterns for cell types by unsupervised learning and its alternative–semi-supervised learning that utilizes a few labeled cells as guidance for cell-type annotation. Results We propose a robust cell-type annotation method scSemiGCN based on graph convolutional networks. Built upon a denoised network structure that characterizes reliable cell-to-cell connections, scSemiGCN generates pseudo labels for unannotated cells. Then supervised contrastive learning follows to refine the noisy single-cell data. Finally, message passing with the refined features over the denoised network structure is conducted for semi-supervised cell-type annotation. Comparison over several datasets with six methods under extremely limited supervision validates the effectiveness and efficiency of scSemiGCN for cell-type annotation. Availability and implementation Implementation of scSemiGCN is available at https://github.com/Jane9898/scSemiGCN.


Introduction
Single-cell RNA sequencing (scRNA-seq) which measures signals of genetic molecular at cell resolution enables cell-type stratification to reveal cell heterogeneity, hence allows to uncover cell lineages and composition of complex tissues, providing detailed landscapes of cell fate trajectories and progress of diseases in organism (Tang et al. 2009, Treutlein et al. 2014, Han et al. 2020).Undoubtedly, exploring scRNA-seq data is fundamental to achieve such biological understanding and clinical applications (L€ ahnemann et al. 2020, Wu andZhang 2020).
Cell-type annotation that distinguishes different types of cells is a vital step in scRNA-seq data analysis.Traditional annotation methods first build unsupervised learning models to divide cells into subgroups according to the underlying difference in scRNA-seq data, then find the marker genes for each group with differential expression analysis.By matching marker genes with known cell types, subgroups are labeled with annotations (Wu and Zhang 2020).Clustering is the key step in the pipeline, and a few of works have been proposed to process scRNA-seq data (Gr€ un et al. 2015, Levine et al. 2015, Macosko et al. 2015, Wang et al. 2017).For example, SIMLR learned similarities between cells via multiple kernel learning, then applied spectral clustering to discover subgroups (Wang et al. 2017).SAFE-clustering integrated outcomes of four popular clustering methods by hypergraph partitioning algorithms to obtain a consensus result (Yang et al. 2019).With the advantages of deep models in representation learning, a number of cell-type annotation methods based on deep networks have been developed.Wang et al. (2021a) proposed a deep learning framework that iterated within multiple autoencoders to learn graph embedding of cells, and then obtained cell clusters by k-means and Louvain.Tian et al. (2019) trained an auto-encoder with both zero-inflated negative binomial loss and KL-divergence to learn low-dimensional embeddings and clustering assignment simultaneously.
Although unsupervised methods are label-free, they require expertise to find marker genes or reference databases for a specific cell type (Shao et al. 2021).Decoupling clustering and annotating in the learning process may also lead to biologically meaningless subgroups.Cell-type annotation methods based on semi-supervised learning emerge as an economic way to tackle these issues (Kim et al. 2019, Chen et al. 2021, Wei and Zhang 2021, Dong et al. 2022, 2023, Xu et al. 2022, Seal et al. 2023).Dong et al. (2023) employed word2vec to learn gene embeddings which were fed into branch bidirectional LSTM networks with a shared module.Then they trained their model with labeled and unlabeled data in a multi-task learning manner.Similarly, Xu et al. (2022) proposed scSemiGAN which consisted of generative adversarial networks and a decoder to obtain cell-type identities and latent representations of cells using labeled data as additional supervised signals.Wei and Zhang (2021) proposed to annotate unlabeled cells by alternatively updating a logistic regression model and spectral clustering as the former acted as a predictive model while the latter generated pseudo labels of cells under the consistency constraint.These semisupervised cell-type annotation methods show promising results, but they can be further improved under the consideration of quality of scRNA-seq data.
Owing to technical artifacts of scRNA-sequencing, scRNAseq data are contaminated with high level of noise caused by sequencing depth, experimental designs, and operations, etc. (L€ ahnemann et al. 2020).Apart from technical issues, cell-type annotation is also plagued with biological challenges.For examples, transient biological states bring ambiguity in cell-type identification (Kiselev et al. 2019), and skewed distributions of cell types make it difficult to capture patterns of rare categories.Thus, using raw scRNA-seq data directly for analysis generally leads to unsatisfactory outcomes.To ensure trustworthy discoveries, efforts have been made to handle batch effects and dropout of scRNA-seq data (Huang et al. 2018, Wei and Li 2018, Korsunsky et al. 2019, Yu et al. 2023).
In this paper, we present a semi-supervised method based on multi-layer graph convolutional networks (GCN) (Kipf and Welling 2017) called scSemiGCN for cell-type annotation.GCN has been employed for scRNA-seq data analysis for its outstanding ability to capture complex and high-order connections in networks (Wang et al. 2021a,b, Gao et al. 2023, Lewinsohn et al. 2023).By representing cells as nodes in a network, holistic topological relationship between cells is built by messages passing in a forward GCN.The adjacent matrix in GCN, usually constructed by scRNA-seq data to depict the relationship between cells, is crucial in transmission of information.But it may be unreliable due to the low signal-to-noise ratio of scRNA-seq data, and thus impairs the learning process, while few have taken it into account.
To address this issue, we apply SIMLR to learn similarities between cells and subsequently employ Network Enhancement (NE) (Wang et al. 2018) as a denoising procedure that diminishes suspicious connections and strengthens forceful links.By replacing the two-sided normalized transmission matrix with the denoised similarity matrix that achieves favorable eigengap in GCN, scSemiGCN ensures a discriminative structure in the cell-to-cell network, which helps to improve the predictive power.Additionally, to achieve better representations of nodes in the network as initial features for GCN, the raw scRNA-seq data are projected to a discriminative representation space by supervised contrastive learning (Khosla et al. 2020), where cells from the same types lie close and the different are far apart.To this end, all cells should be annotated beforehand.scSemiGCN preliminarily generates pseudo labels for unlabeled cells by k-nearest neighbors (KNN) leveraging the denoised similarity matrix and only a few labeled cells.With the advantage of an enhanced network structure and discriminative initial features, we can finally attain a powerful two-layer GCN for cell-type annotation prediction learned with a small number of annotated cells.
To summarize, our contributions are as follows: � We propose scSemiGCN consisted of topological denoising and feature refinement to handle low signal-to-noise-ratio scRNA-seq data for semi-supervised cell-type annotation.
The framework of scSemiGCN is shown in Fig. 1. � By applying a denoising procedure to cell-to-cell similarities, we obtain a more reliable network structure from which we generate pseudo labels and build a denoised GCN.� We then refine scRNA-seq data by supervised contrastive learning with pseudo labels using the denoised GCN as backbone.Sequentially, we learn a denoised GCN for cell-type annotation with refine features and extremely limited supervision.� We evaluate scSemiGCN in six real scRNA-seq datasets and a more challenging continuum dataset by comparing with semi-supervised and unsupervised methods.Experimental results show its competitive or even better performance over competing methods utilizing only five percent of labeled cells.

Notations and preliminaries
We denote a scRNA-seq expression matrix as X 2 R m�n that contains expression signals of n genes of m cells.The lowercase bold symbol x i 2 R n denotes the expression signals of cell i.Suppose there are m l annotated cells and m u unannotated cells.The genomic expression matrix X can be denoted as X ¼ ½X l ; X u � where X l 2 R m l �n and X u 2 R mu�n are expression matrices of annotated and unannotated cells, respectively.Considering there are c types of cells, let Y l 2 f0; 1g m l �c represent the cell-type indication matrix of X l where cell types of cells are denoted by one-hot coding in rows.We aim to infer the cell-type indication matrix Y u 2 f0; 1g mu�c of the unannotated m u cells.Here we propose scSemiGCN to achieve this goal.scSemiGCN consists of three stages.First, we generate pseudo labels for unannotated cells with KNN by leveraging a denoised similarity matrix.Then we refine the low signalto-noise-ratio scRNA-seq data by projecting it onto a discriminative representation space in a supervised contrastive learning paradigm.Finally, we train a two-layer GCN with labeled cells for cell-type annotation using a more reliable topological network structure and discriminative features as input.Details are presented in the following sections.

Generate pseudo labels with topological denoising
Cell-to-cell similarities.We employ SIMLR (Wang et al. 2017) to learn cell-to-cell similarities.SIMLR returns a similarity matrix S 2 R m�m þ by alternating optimization: where I m and I d are m � m and d � d identity matrices, respectively.The symbols d, b, q, and c denote non-negative hyperparameters.The parameter d can be set as the number of desired clusters in the dataset.Both b and c are estimated by a data-driven approach and q is set as m 2 .The modified Gaussian kernels with different hyperparameters are used to define a series of kernels fK l g L l¼1 , each of which takes the form as The scaled parameter � ðlÞ i;j is computed by where N ðlÞ xi is the top k ðlÞ nearest neighbors of x i in Euclidean distance.By varying ðk ðlÞ ; r ðlÞ Þ, we obtain multiple kernels fK l g L l¼1 .Following Wang et al. (2017), we generate 55 kernels by setting k ðlÞ 2 f10; 12; 14; . . .; 30g and r ðlÞ 2 f1:0; 1:25; 1:50; . . .; 2g.
Topological denoising.Apparently, kernels calculated by Equation ( 2) are still suspicious due to high dimensionality and high noise level of x i , causing ambiguous neighbors constructed in Euclidean space.Hence it may lead to an undermined similarity matrix S. Wang et al. (2017) proposed a diffusion step for S to alleviate such tendency.Alternatively, we here apply a more powerful Network Enhancement (NE) (Wang et al. 2018) which provides provable guarantee by spectral analysis of transition matrix in random walks.Specifically, NE defines the transition matrix where N i is the k-nearest neighbors of cell i with size as K and I f�g denotes an indicator function.The similarity matrix is updated by random walks: where we initialize S 0 with S returned by SIMLR, and a is a regularization parameter for restart.It can be shown that Equation (5) converges to an equilibrium graph, i.e.
where I m is a m � m identity matrix.Thus we can obtain a denoised similarity matrix as Given the eigen-decomposition of the transition matrix T ¼ URU −1 , where R is a diagonal matrix with eigenvalues scSemiGCN of T as diagonal elements and U is consisted of corresponding eigenvectors as columns, by Equation ( 7), we have Hence, the denoised similarity matrix S can be computed by Equation ( 8) instead of its iteration form Equation ( 5), with computational complexity as Oðm 3 Þ.
From Equation ( 8), it is proved that S obtains a larger eigengap than S has [cf.Lemma 3 in Wang et al. (2018)], thus results in a more discriminative similarity metric (or network structure).
Preliminary annotation.The entry of S, denoted as si;j , indicates the similarity between cell i and cell j.Once having a denoised similarity matrix S, we can generate pseudo labels for unannotated cells by KNN with labeled cells utilizing S as a similarity metric.Formally, for an unlabeled cell i, let N L k ðiÞ ¼ ftop k oflabeledcellsmostsimilartocell ig, then the one-hot coding of pseudo label of cell i is ỹi ¼ modeðfy j jcell j 2 N L k ðiÞgÞ; where y j represents the ground-truth label of cell j in one-hot coding and modeð�Þ denotes the majority voting operator.In our experiments, we simply set the size of nearest neighbors k ¼ 1.

Refine scRNA-seq data by supervised contrastive learning with a denoised GCN
In this section, we propose to refine scRNA-seq data X with supervised contrastive learning (SCL) using precise labels of X l and pseudo labels of X u for supervision.Each cell is projected onto a discriminative representation space where cells from the same types lie together and the different are far apart, with dimensionality unchanged.We define the projection as a one-layer GCN, i.e.
where U 2 R n�n is a learnable weight matrix and ReLUð�Þ ¼ maxð0; �Þ.Vanilla GCN.In the vanilla graph convolutional network, the two-sided normalization symmetric matrix A that reveals the topological structure of the network is defined as , where Ã is the adjacent matrix with selfconnections and D is a diagonal matrix with its diagonal elements as Di;i ¼ P m j¼1 Ãi;j .Denoised GCN.From Equations (4) and (5), Wang et al. (2018) showed that S tþ1 remains as a two-sided normalization symmetric matrix in each iteration, i.e. S tþ1 1 ¼ 1 and S T tþ1 ¼ S tþ1 with non-negative elements in S tþ1 .Thus, the limit of S tþ1 , i.e. S, is still a two-sided normalization symmetric matrix.Here, the bold symbol 1 indicates an all-one vector with dimensionality of m, i.e. 1 ¼ ð1; 1; . . .; 1Þ T .Further, the following theorem indicates the range of eigenvalues of the limit.
Theorem 1 The eigenvalues of the equilibrium graph Equation (7) fall into the range ½0; 1�.
Proof.Wang et al. (2018) have shown that T is positive semi-definite and two-sided normalized with non-negative entities.Let k be an eigenvalue of T, by the Gershgorin circle theorem, we have jk−T i;i j � P j6 ¼i jT i;j j, which implies k � T i;i þ P j6 ¼i jT i;j j ¼ P m j¼1 T i;j ¼ 1 since T1 ¼ 1 and T i;j � 0. Because T is positive semi-definite, we conclude that the eigenvalues of the transition matrix T 2 ½0; 1�.By Equation ( 8), the eigenvalue of S can be represented as ð1−aÞk 1−ak 2 where k denotes the eigenvalue of T. Since a 2 ½0; 1�, we conclude our statement. w In above proof, we simplify the claim in Wang et al. (2018) by using the Gershgorin circle theorem.
In a word, S is a two-sided normalized symmetric matrix with eigenvalues in the range ½0; 1�.Thus, it can be used as A in Equation ( 9), since it naturally avoids numerical instability or extreme gradients in GCN (Kipf and Welling 2017) and captures the denoised network structure simultaneously.In the end, Equation ( 9) can be rewritten as The refined representation of cell i is indicated by xi , corresponding to the i-th row of X .
Supervised contrastive loss.Once we obtain the refined representations of cells by Equation ( 10), the unknown parameter U is optimized by minimizing the supervised contrastive loss where B denotes a set of samples in mini-batch, and P i and M i are a set of samples in the mini-batch with the same label and different labels of cell i, respectively, and zi is the unit normalization of xi , i.e. zi ¼ xi =jjx i jj 2 .The symbol s in Equation ( 11) denotes the free-tuning temperature, which we set as 0.5 in our experiments, and jP i j counts the number of cells in P i .
It should be noted that we only train a few steps with the supervised contrastive loss to learn the refined representations of cells in a bid to improve robustness since the labels of unannotated cells are not accurate in the current stage.

Annotate cell types with limited supervision
With refined representations X and a denoised network structure S available, we finally build a two-layer GCN for cell-type annotation, i.e.
where W 1 2 R n�h and W 2 2 R h�c are learnable parameters denoted by W.Here we use a denoised GCN with the same motivation as Equation ( 10).Each row of Ŷ represents predicted cell-type probability distribution of cells.The model is merely optimized by the cross-entropy loss over labeled cells X l , i.e.
where B l is a mini-batch of cells with precise labels, and y i and ŷi are ground-truth annotation of cell i in one-hot coding and corresponding predicted cell-type probability distribution, respectively.

Competing methods and datasets
Four recently developed cell-type annotation methods were used for comparison with scSemiGCN, including CALLR (Wei and Zhang 2021), scSemiGAN (Xu et al. 2022), scSemiAE (Dong et al. 2022), and SIMLR (Wang et al. 2017).We also compared our denoised GCN with the vanilla GCN (Kipf and Welling 2017) and Graph Attention Networks (GAT) (Veli� ckovi� c et al. 2018) which have been widely used for semi-supervised node classification in graph learning.All of the competing methods are semi-supervised, except the unsupervised SIMLR.Here we used the similarity matrix return by SIMLR to annotate the unlabeled cells with KNN for comparison.We ran our experiments with six single-cell RNA-seq datasets.Five of them, namely Buettner, Kolodziejczyk, Pollen, Usoskin, and Zeisel were taken from Wang et al. (2017), and the rest Cortex was created by Dong et al. (2022).Detail descriptions of these datasets are presented in Table 1.

Experimental settings
In our experiments, only five percent of cells in each dataset were supposed to have been annotated during training, and the rest were evenly divided for validation and test.The number of annotated cells used in training for each cell type in each dataset is summarized in Supplementary Table S1.Accuracy, F1-score, and the area under the ROC curve (AUC) of predicted annotations of test sets are reported.
We apply SIMLR to learn to cell-to-cell similarities with default settings in its implementation (Wang et al. 2017).How the choice of SMILR's parameters affects scSemiGCN is investigated and presented in Supplementary Figs S1 and S2.The regularized parameter a and neighborhood size K in network enhancement, i.e.Equation ( 7), should to be tuned in experiments.We set a in the range of ½0:4; 0:5; 0:6�.The range of neighborhood size K was set according to the average number of cells in each cell type.In Butter and Pollen, it was chosen between 18 and 20, while between 20 and 22 for the rest.The combination of a and K was determined by the highest accuracy in validation data for each dataset.
We set the number of genes as the dimension of input n for GCNs in all datasets except Cortex where we selected the top 2000 most variable genes as input for GCNs.The dimensionality of the hidden layer h in Equation ( 12) was fixed as 100.The size of mini-batch B l in Equation ( 13) was set as 100.
In stage II, we trained a denoised GCN in 10 epochs by supervised contrastive learning using SGD as the optimizer with learning rate as 0.05.In the final stage, we trained the twolayer GCN in 400 epochs using Adam as the optimizer, and the learning rate was set to be 0.001 in Buettner, Pollen, and Cortex, and 0.0005 for the rest.

Performance in cell-type annotation
We report the comparison between scSemiGCN and the competing methods for cell-type annotation under three metrics.Results are summarized in Table 2. AUC is not reported for CALLR and scSemiAE since they returned predicted labels without probability estimation.scSemiGCN demonstrates competitive and even dominant performance in all six datasets, showing its favorable robustness and adaptability.In SIMLR, we annotated cells by KNN using the learned similarities where the size of neighborhood for annotation k was selected in the range of f1, 3, 5g and determined by validation data.CALLR requires that there are at least two annotated samples per cell type in training.Hence, we used 10% of annotated cells when running CALLR in Pollen.
There is obvious gap between our scSemiGCN and the rest methods in Pollen.It should be noted that there were at most two annotated cells used in training for each cell type in this dataset (see Supplementary Table S1).The result implies scSemiGCN is highly effective in this extremely limited supervision scenario.
Comparing between SIMLR and scSemiGCN, we see that scSemiGCN generally improves SMILR except in Buettner.Such improvement is particularly significant in Pollen, bought from both feature refinement and topological denoising as clarified in the following ablation studies.GCN outperforms GAT in four out of six datasets, which implies that GCN is a better graph-neural-network-based backbone than GAT for scSemiGCN.
We visualize latent representations generated by neuralnetwork-based methods in three datasets with t-SNE (van der Maaten and Hinton 2008) in Fig. 2. Different categories are well separated in Kolodziejczyk of all methods.For the larger and more diverse Zeisel and Cortex, cell types with a larger proportion are more easily identified and the rare tends to be mixed with others.But we still see that the rare cell type of Zeisel indicated by green is better separated in scSemiGCN.Visualization of the rest datasets can be found in Supplementary Fig. S3.

Ablation studies and analysis
We also studied the effectiveness of feature refinement and topological denoising.To this end, we ran scSemiGCN bypassing stage II (withdrawing supervised contrastive learning), and denoising in stage I (withdrawing network enhancement), respectively, while keeping the remaining experimental settings unchanged.When skipping denoising in stage I, we correspondingly used vanilla GCN as the backbone instead.We present the results in Fig. 3.We can see that feature refinement helps in all datasets except Zeisel where there is marginal difference between our full model and the model without supervised contrastive learning.Network enhancement apparently boosts scSemiGCN in four out of six datasets and such improvement is significant in Pollen and Zeisel.For example, network enhancement brings nearly 10% of accuracy improvement in Pollen.We conclude that the combination of feature refinement and topological denoising delivers robustness and adaptability into scSemiGCN.Hence, we believe scSemiGCN is widely applicable to scRNA-seq data.

Effect of the number of annotated cells
To investigate the impact of annotated proportion of cells used in training on scSemiGCN, we varied the ratio of annotated cells, ranging from 5% to 45% in training.For each annotated ratio, we repeated random data split five times.The accuracy and AUC of test data are reported.Results of Usoskin and Zeisel are shown in Fig. 4. scSemiGCN can perform better with more labeled data used for training, but it is also marginal since scSemiGCN can make good annotation using only a small proportion of labeled cells.

Parameter analysis
Network enhancement is at the core of scSemiGCN.To study how scSemiGCN is affected by the regularized parameter a and the neighborhood size K in NE, we report accuracy of validation data under different settings of these  two parameters in Fig. 5. AUC and F1-score are demonstrated in Supplementary Figs S4 and S5.We observe that scSemiGCN is more sensitive to the neighborhood size K in Pollen than it is in the remaining three datasets, and we posit that it is attributed to the diversity of Pollen.We also notice scSeimiGCN seems to be more stable in a larger  scSemiGCN dataset under various settings of the regularized parameter a, implying more efforts are needed to be taken to find an appropriate a for a smaller dataset.

Cell-type annotation for continuum immune cells
Additionally, we validated our method on a more challenging dataset built from the downsampled Tumor Immune Cell Altas (TICA) (Nieto et al. 2021).The constructed dataset consisted of 4223 cells from three cancer types, namely intrahepatic cholangio-(ICC), ovarian cancers (OC) and nonsmall-cell lung cancers (NSCLC), including 25 immune cell types.We labeled this dataset as TICA-3C.Top 2000 most variable genes were used.We followed the previously described data splitting.Only 211 labeled cells were used for training (see Supplementary Table S2) and the rest was used for validation and test.We trained scSemiGCN on TICA-3C with the same hyperparameter setup as on Cortex except the neighborhood size in NE and the learning rate for training the two-layer GCN which were set as 50 and 0.002, respectively.SIMLR and all deep learning methods performed poorly on the demanding TICA-3C with scSemiGCN achieving the highest accuracy 0.4432 (see Supplementary Table S3).We visualize the latent representations learned by scSemiGCN.As shown in Fig. 6, the complex and diverse T cells are mixed while the simpler B cells are better separated.Such observation is also verified by the confusion matrix (see

Figure 1 .
Figure 1.Framework of scSemiGCN.It consists of three stages: (i) generating pseudo labels for unannotated cells with denoised similarities in k-nearest neighbors (KNN); (ii) projecting raw features onto a discriminative representation space by supervised contrastive learning; (iii) training a cell-type annotation model with labeled cells in a two-layer graph convolutional network (GCN) using refined features and the denoised network structure as inputs.

Figure 2 .
Figure 2. Visualization of latent representations generated by neural-network-based methods.Cell types are indicated by colors.Even there is not significant difference in separation among these methods, scSemiGCN is better at spotting a rare cell type in Zeisel indicated by red boxes.(a) Kolodziejczyk; (b) Zeisel; (c) Cortex.

Figure 4 .
Figure 4. Influence of annotated ratio on the performance of scSemiGCN.For each ratio, we repeated random data split five times.Accuracy and AUC of test data are presented (mean±SD).(a) Usoskin; (b) Zeisel.

Table 2 .
Summary of evaluation metrics for each method in test data of each dataset.a Effectiveness of feature refinement and topological denoising.We ran scSemiGCN without supervised contrastive learning (w o SCL) and without network enhancement (w o NE), respectively, in comparison with our full model scSemiGCN.Performance in the test of all datasets is presented.Note that it cannot tell the difference between scSemiGCN and w o NE in Kolodziejczyk.